CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes by Jens-G · Pull Request #4813 · apache/cassandra

Jens-G · 2026-05-15T22:45:34Z

Summary

COPY TO followed by COPY FROM corrupts text column values that contain backslashes: each round-trip doubles the backslash count. Reported in CASSANDRA-21131.

Before (one round-trip):

Stored: V\S → exported CSV: V\\\\S → re-imported: V\\S ❌
Stored: \"Marianne"\ → re-imported: \\"Marianne"\\ ❌

list<text>, set<text>, map<text,text>, tuples and UDTs with text fields are affected in the same way.

Root Cause

format_value_text in formatting.py doubles backslashes unconditionally:

escapedval = val.replace('\\', '\\\\')

This is intentional for terminal display (SELECT output shows V\\S so the backslash is visible). However, ExportProcess.format_value in copyutil.py calls the same function when writing CSV. The csv.writer (configured with escapechar='\\') then escapes backslashes a second time, quadrupling them in the CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes in Cassandra.

Fix

Add an escape_backslash parameter (default True, preserving existing terminal display behaviour) to format_value_text, format_simple_collection, and all collection formatters. Pass escape_backslash=False from ExportProcess.format_value so the csv.writer handles all backslash escaping exclusively.

Changed functions:

format_value_text — new parameter
format_simple_collection — new parameter, propagated to element format_value calls
format_value_list, format_value_set, format_value_tuple — new parameter, forwarded to format_simple_collection
format_value_map — new parameter, propagated through subformat
format_value_utype — new parameter, propagated through format_field_value
ExportProcess.format_value in copyutil.py — passes escape_backslash=False

Test Plan

Two standalone Python test scripts (no running Cassandra cluster required) are attached to the JIRA ticket and verify the bug and fix:

test_cassandra_21131.py — 10 test cases for plain text columns: 5/10 pass before fix → 10/10 after
test_cassandra_21131_collections.py — 12 test cases for list/set/map<text>: 3/12 before → 12/12 after

Integration testing against a live cluster with the exact scenario from the bug report (COPY TO → TRUNCATE → COPY FROM → SELECT) is needed before merge.

Notes

A separate but related bug (UNICODE_CONTROLCHARS_RE converting control chars like \n to repr-notation \\n during CSV export) was discovered and will be tracked in a separate ticket.
The Generated-by: commit token is included per ASF generative tooling policy. The fix was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under human review and direction. All code has been verified manually.

🤖 Generated with Claude Code

…ng backslashes format_value_text in formatting.py doubles backslashes for terminal display (so SELECT output renders them visibly). When used via ExportProcess.format_value for COPY TO, this pre-escaping is applied before csv.writer runs its own backslash escaping (escapechar='\\'), resulting in quadrupled backslashes in the CSV file. On COPY FROM the csv.reader unescapes once, leaving doubled backslashes in Cassandra — data corruption that compounds on every round-trip. The fix adds an escape_backslash parameter (default True, preserving existing terminal display behaviour) and passes escape_backslash=False from the CSV export path in ExportProcess.format_value. The parameter is propagated through format_simple_collection, format_value_list/set/tuple/map, and format_value_utype so that collection types (list<text>, set<text>, map<text,text>, UDTs) are covered as well. Generated-by: Claude Sonnet 4.6 (Anthropic) with human review and direction

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

bschoening · 2026-05-18T15:22:59Z

@Jens-G It seems the exporter is sending values through the display formatter, which doubles backslashes for human-readable SELECT output, before handing them to the CSV writer. Claude suggests just stop using the display formatter for CSV export of text in copyutil.py. What would you think of that approach?

def format_value(self, val, cqltype):
    if val is None or val == EMPTY:
        return format_value_default(self.nullval, colormap=NO_COLOR_MAP)

    # Text-like values: pass through unmodified. The csv.writer will apply
    # CSV-level escaping using the dialect's escapechar. Running these
    # through the display formatter would double-escape backslashes, see
    # CASSANDRA-21131.
    if cqltype.type_name in ('text', 'varchar', 'ascii'):
        return val
    ...

Jens-G · 2026-05-18T16:43:10Z

TBH if it works I'm fine with either approach 👍 🚀

bschoening

Consider an alternative approach where formated_value() bypasses formatting for text

if cqltype.type_name in ('text', 'varchar', 'ascii'):
return val

bschoening · 2026-05-18T16:55:16Z

        formatted = formatter(val, cqltype=cqltype,
                              encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format,
                              float_precision=cqltype.precision, nullval=self.nullval, quote=False,
+                              escape_backslash=False,


Consider an alternative approach where formatted_value() bypasses display formatting for text.

format_value(): ... if cqltype.type_name in ('text', 'varchar', 'ascii'): return val if val.isprintable() else None

bschoening · 2026-05-18T16:55:41Z

+                              escape_backslash=False,
                              decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep,
                              boolean_styles=self.boolean_styles)
        return formatted


Please add unit tests

arvindKandpal-ksolves · 2026-06-03T01:13:47Z

Hi @Jens-G , if you are busy I can add the unit tests for this and you can review that — just let me know if that works. ?

bschoening · 2026-06-03T22:10:54Z

@Jens-G the PR needs additional work. Based on our discuss of RFC 4180, proper text values don't require formatting. In format_value():

if cqltype.type_name in ('text', 'varchar', 'ascii'):
    return val if val.isprintable() else None

that may eliminate the need to have a new argument for escape_backslash

… handling Add TestExportFormatValue to test_copyutil.py, covering ExportProcess.format_value: - scalar text/varchar/ascii values keep single backslashes on export - text inside list/set/map/tuple keeps single backslashes (the collection formatters propagate escape_backslash, which a scalar-only type check misses) - backslashes survive a csv.writer -> csv.reader round-trip with the COPY dialect - the terminal-display path still doubles backslashes (escape_backslash default), preserving SELECT rendering The tests were confirmed to fail when the escape_backslash=False export change is reverted, so they guard against regressing the fix. Generated-by: Claude Opus 4.8 (Anthropic) with human review and direction Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Jens-G · 2026-06-04T08:21:45Z

Thanks @bschoening! I worked through the scalar fast-path. It's a clean simplification for text/varchar/ascii, but I don't think it lets us drop the escape_backslash argument — here's the case it misses.

The ticket also covers list<text>, set<text>, map<text,text>, tuples and UDTs. Those aren't serialized as scalars: format_value_list/_set/_map/_utype format each element by routing it back through format_value_text (with quote=True). A cqltype.type_name in ('text','varchar','ascii') check in format_value() never sees them, so their elements still get their backslashes doubled. Since collection elements always pass through format_value_text, the no-double-escape signal has to live on that function regardless of what we do at the scalar level — so the argument can't be fully removed without dropping collection support.

I've added TestExportFormatValue in this PR to pin this down. The collection cases serialize list/set/map/tuple<text> through the export path and assert single backslashes; they fail (doubled) if the escape_backslash=False change is reverted — which is effectively what a scalar-only check does to collections. (On import there's no compensation either: the elements are CQL string literals, and dequote_value only un-doubles single quotes, not backslashes.)

On return val if val.isprintable() else None: with the parameter approach it's moot, but for the record else None would silently null the whole value for any text containing a newline or tab (it becomes an empty field). I understand the intent of keeping control characters out of the CSV — but that's separate from the backslash doubling, and format_value_text rewrites control chars via UNICODE_CONTROLCHARS_RE either way, so I'd keep it out of this ticket and handle it in the UNICODE_CONTROLCHARS_RE follow-up I mentioned.

Happy to add a scalar fast-path on top of the parameter if you'd like it for readability — as an addition rather than a replacement. WDYT?

Jens-G · 2026-06-04T08:22:13Z

@arvindKandpal-ksolves thanks a lot for offering to help! I've gone ahead and added the unit tests in this PR (TestExportFormatValue in test_copyutil.py), so this one is covered. Much appreciated. 🙏

bschoening · 2026-06-04T16:44:54Z

@Jens-G Based on what we learned with CASSANDRA-21381, we should broaden the scope here to handle all TEXTDATA properly.

COPY TO should allow roundtrip all TEXTDATA. Backslash is just another character in the %x2D-7E range - a perfectly ordinary TEXTDATA character with no special status so escape_backslash as an option isn't needed. The only escape production is DQUOTE.

format_value: agreed, let's fast-path it — skip formatting for printable text and sanitize only the non-printable cases..
control characters: agree that else None is too aggressive. Text containing TAB, CR, LF should be enclosed in double quotes (the CSV escaped form), in a sanitize field function. The remaining C0/C1 controls are outside the scope of RFC 4180; rather than support round-tripping them, I'd suggest a one-way, percent-encoded translation (%00, %1B, …) — non-reversible by design, and less easily confused with Python's own backslash escaping.
CSV escaping implies that COPY FROM would revert DQUOTE escapes
collections: agreed -- list, set, map<text,text> need the same sanitization and should be in scope here.

BTW, sineemore/csv-test-data has test fixtures we could cherry-pick for unit tests.

Relevant RFC 4180 grammer:

field = (escaped / non-escaped)

escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE

non-escaped = *TEXTDATA

TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

DQUOTE = %x22

bschoening requested a review from Copilot May 18, 2026 13:28

Copilot started reviewing on behalf of bschoening May 18, 2026 13:28 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

bschoening requested changes May 18, 2026

View reviewed changes

bschoening mentioned this pull request May 19, 2026

CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values #4819

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813

CASSANDRA-21131: Fix CSV COPY TO/FROM corrupting text values containing backslashes#4813
Jens-G wants to merge 2 commits into
apache:trunkfrom
Jens-G:jens-g/CASSANDRA-21131/trunk

Jens-G commented May 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

bschoening commented May 18, 2026 •

edited

Loading

Uh oh!

Jens-G commented May 18, 2026

Uh oh!

bschoening left a comment •

edited

Loading

Uh oh!

bschoening May 18, 2026 •

edited

Loading

Uh oh!

bschoening May 18, 2026

Uh oh!

arvindKandpal-ksolves commented Jun 3, 2026 •

edited

Loading

Uh oh!

bschoening commented Jun 3, 2026 •

edited

Loading

Uh oh!

Jens-G commented Jun 4, 2026

Uh oh!

Jens-G commented Jun 4, 2026

Uh oh!

bschoening commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Jens-G commented May 15, 2026

Summary

Root Cause

Fix

Test Plan

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

bschoening commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jens-G commented May 18, 2026

Uh oh!

bschoening left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschoening May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bschoening May 18, 2026

Choose a reason for hiding this comment

Uh oh!

arvindKandpal-ksolves commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bschoening commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jens-G commented Jun 4, 2026

Uh oh!

Jens-G commented Jun 4, 2026

Uh oh!

bschoening commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bschoening commented May 18, 2026 •

edited

Loading

bschoening left a comment •

edited

Loading

bschoening May 18, 2026 •

edited

Loading

arvindKandpal-ksolves commented Jun 3, 2026 •

edited

Loading

bschoening commented Jun 3, 2026 •

edited

Loading

bschoening commented Jun 4, 2026 •

edited

Loading